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Learning Classifiers from Synthetic Data Using a 

Multichannel Autoencoder 

Xi Zhang, Yanwei Fu, Andi Zang, Leonid Sigal, Gady Agam 


Abstract —We propose a method for using synthetic data to help learning classifiers. Synthetic data, even is generated based on real 
data, normally results in a shift from the distribution of real data in feature space. To bridge the gap between the real and synthetic data, 
and jointly learn from synthetic and real data, this paper proposes a Multichannel Autoencoder(MCAE). We show that by suing MCAE, 
it is possible to learn a better feature representation for classification. To evaluate the proposed approach, we conduct experiments 
on two types of datasets. Experimental results on two datasets validate the efficiency of our MCAE model and our methodology of 
generating synthetic data. 
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1 Introduction 

Large and balanced datasets are normally crucial for 
learning classifiers. In real-world scenarios, however, one 
always struggles to find adequate amounts of labeled 
data. Even with the help of crowdsourcing, e.g., Amazon 
Mechanical Turk (AMT), it is often difficult to collect 
a large quantity of labeled instances with high quality 
that is necessary for training a classifier for a real-world 
problem. In terms of quantity, it has been shown that 
the amount of available training data, per object class, 
roughly follows a Zipf distribution l35l . That means a 
small number of object classes account for most of the 
available training instances. In terms of quality, some 
domains, such as the analysis of satellite images (e.g. 
the comet images from Rosetta), require extensive and 
detailed expert user annotation (32)/ [48]. Large volume 
of LiDAR point cloud data have to be labeled before they 
can be used to train some classifiers (491 . Such labeling 
process usually is very time consuming and requires 
expert-level labeling efforts or expensive equipments. 
Practically only a very limited portion of the data points 
can be obtained. 

To solve the problem of lacking enough training sam¬ 
ples, attributes [22], l30l , 113] have been introduced 
to transfer the knowledge held by majority classes to 
instances in minority classes. Nevertheless, for certain 
tasks, such shared attributes HU, El, IS, m, EH 
may simply be unavailable or nontrivial to define. In 
contrast, rather than using such a learning to learn' 
l38l framework, humans can generalize and associate the 
similar patterns from images. This ability inspires us to 
circumvent the problem of lacking enough training data 
and solve it from a different angle: utilizing the synthetic 
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Figure 1. Examples of (a) real roof edge vs. correspond¬ 
ing (b) synthetic roof edge images. The synthetic data is 
generated by the algorithms in Sec. 4. The examples are 
randomly drawn from the SRC dataset. 

data (e.g. the synthetic roof edges in Fig. Ill associated 
with real data (e.g. real roof edges in Fig. pT in order to 
learn a better classifier. 

The idea of associating synthetic data with real data 
has a long history and is associated with the devel¬ 
opment of cognitive psychology, artificial intelligence, 
and computer vision. For example, cognitive psychology 
studied a case that an infant learns to understand and 
imitate a facial expression from parents' examples. In the 
computing domain, exemplar SVM (261 tries to associate 
images with training exemplars. Different from these 
previous works, we create synthetic images to associate 
them with the real images whilst previous works asso¬ 
ciate 'new' real data with 'old' real data. By contrast, our 
approach is a 'free lunch' in the sense that the proposed 
approach does not need any human annotation of real 
data, thus we could easily amplify the dataset used in 
training. 

Learning a classifier from synthetic data is unfortu¬ 
nately extremely challenging due to the following rea¬ 
sons. Firstly, the feature distribution of synthetic data 
generated will shift away from that of real data. Such 
distribution shift is termed synthetic gap and illustrated 
in Fig [2] The synthetic gap is a major obstacle in using 
synthetic data to help learning classifiers, since synthetic 
data may fail to simulate the potential useful patterns 
of real data for training classifiers. To our knowledge, 
this synthetic gap problem has never been formally 
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Figure 2. t-SNE visulization of synthetic gap using the 
data from SRC dataset, (a) synthetic gap of real and 
synthetic data; (b) MCAE bridges the synthetic gap. 


identified nor addressed in the literature. Secondly, since 
practically a small amount of labeled images may be 
available, it is necessary to jointly learn from synthetic 
and real data. The learning process must be automati¬ 
cally leveraged between synthetic data and real data. 

To better learn a classifier from synthetic data, we 
propose a novel framework -Multichannel Autoencoder 
(MCAE) which is an extension of sparse autoencoder. 
The training step of MCAE is a process of bridging the 
synthetic gap between the real and the synthetic data by 
learning the mapping from (1) synthetic to real data and 
(2) real to real data. Critically, such mapping try to keep 
the real data while enforce MCAE to learn a transfer 
from the synthetic data to the real data. We thus can 
generate more synthetic data which will simulate the real 
data when the learned mapping is applied to them. 

To facilitate the study on satellite image analysis, we 
introduce a new benchmark satellite roof classification 
(SRC) image dataset. The SRC dataset needs expert- 
level labeling and has unique challenges, such as satellite 
image blurring, building shadows, and extremely im¬ 
balanced roof class instances. To demonstrate the gen¬ 
erality of the proposed approach, we use an additional 
handwritten digit dataset from the UCI machine learning 
repository fU. In both datasets, synthetic data is gener¬ 
ated using a parametric model of derived from real data 
that roughly mimics real data in terms of appearances 
and basic structure. Experimental results using these 
datasets demonstrate that better classification results can 
be obtained by training a classifier using the synthetic 
data when used by the proposed approach. 

We thus highlight three contributions in this paper: (1) 
To the best of our knowledge, this is the first attempt to 
address the problem of synthetic gap, by solving which 
we demonstrate that the synthetic data could be used to 
improve the performance of classifiers. (2) We propose 
a Multichannel Autoencoder (MCAE) model to bridge 
the synthetic gap and jointly learn from both real and 
synthetic data. (3) Also, a novel benchmark dataset - 
Satellite Roof Classification (SRC) is introduced to the 
vision community. Such dataset is of expert-level label 
annotations as well as great challenges for satellite image 
analysis. 


2 Related Work 

3D image analysis. Synthetic data has been used for 
several 3D image analysis applications, but not for help¬ 
ing learn classifiers. A large number of synthetic 3D 
meshes in m were created by a series of mesh edit¬ 
ing steps including subdivision, simplification, smooth, 
adding noise and Poisson reconstruction, in order to 
automatically evaluate the subjective visual quality of 
a 3D object. Recently, to circumvent the point labeling 
difficulty in a building roof classification problem using 
LiDAR point cloud, 11491 explicitly indicated semantic 
roof points on synthetically created roof point clouds 
and compute point features from the synthetic point 
clouds. Techniques such as point cloud resampling, size 
normalization and mesh erosion are employed to reduce 
the differences between real roof and synthetic ones in 
data space. 

Generating synthetic data. Previous method gen¬ 
erate synthetic data in data space using tools includ¬ 
ing geometrical transformation and degradation models: 
In MM, to help off-line recognition of handwritten 
text, a perturbation model combined with morphological 
operation is applied to real data. They showed that 
when a moderate transformation is added to the real 
data, the resulting synthetic training set boost the perfor¬ 
mance. To enhance the quality of degraded document, in 
CO degradation models such as brightness degradation, 
blurring degradation, noise degradation and texture¬ 
blending degradation were used to create a training 
dataset for a handwritten text recognition problem. The 
synthetic minority oversampling technique (SMOTE) (8) 
and its variants mm are also powerful methods 
that have shown many success in various applications. 
However, these previous methods are relatively limited 
to one particular type of dataset, whilst we propose a 
more general methodology of generating synthetic data 
in this paper. We show that our methodology can be 
used both for SRC and handwritten digits dataset. 

Transfer Learning aims to extract the knowledge from 
one or more source tasks and applies the knowledge to 
a target task. Transfer learning has been found helpful 
in many real world problems, such as in sentiment 
classification m, web page classification (36J and zero- 
shot classification of image and video data (23), Il22l , fl~6l , 
ESI, m, llZL (33), |34l, El- Transfer learning is cate¬ 
gorized to three classes [3l]|: inductive transfer learning, 
transductive transfer learning and unsupervised transfer 
learning. The work in this paper falls into a frame¬ 
work of domain adaptation EJ, EU in the transductive 
transfer learning. Nonetheless, different from previous 
domain adaptation tasks of different source and target 
domains, the synthetic gap is caused by the shifted 
feature distribution of synthetic data from real data. To 
solve this problem, our MCAE is developed from the 
idea of autoencoder. 

Autoencoder is one type of neural network and its 
output vectors have the same dimensionality as the 
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input vectors |42). The hidden representation obtained 
by training a sparse autoencoder followed by a pa¬ 
rameters fine tuning is useful in pre-training a deeper 
neural network. Recently autoencoder with its different 
variants m, gei also exhibit the success in learning 
and transferring sharing knowledge among data source 
from different domains ®, B, OH, thus benefit other 
machine learning tasks. 

3 Multichannel Autoencoder (MCAE) 

In this section, we introduce the MCAE model as il¬ 
lustrated in Fig. [3] It can (1) bridge synthetic gap by 
minimizing the discrepancy between real and synthetic 
data; and (2) preserve and emphasize the potential useful 
patterns existed in both real and synthetic data in order 
to generate the better feature representations used for 
learning classifiers. 

Essentially, synthetic and real data should have similar 
patterns, a natural idea of bridging synthetic gap is to 
learning a mapping from the synthetic data to the real 
data using an autoencoder, and vice versa. MCAE, hence, 
provides a more flexible way to learn this mapping 
due to the specific structure of the MCAE. There are 
two channels in MCAE, left one and right one. Each 
channel basically is an SAE, however, two channels 
share the same hidden layer. With this structure, MCAE 
basically learns two tasks in the same time. By setting 
different types of input and out data such as the one in 
denoising autoencoder 11431 , MCAE is capable for many 
applications. In our work, to bridge the gap between 
synthetic data and real data, we set the task in left 
channel as one that takes synthetic data as input and 
real data as reconstruction target , while the task in right 
channel use real data in both input and reconstrution 
target. This configuration actually is essentially meaning¬ 
ful that by keeping the reconstruction target identical in 
two channels, MCAE attempts to transform inputs in 
two channels towards the same target, thus minimize 
the discrepancy between two input dataset which are 
synthetic data and real data in our work. 

3.1 Problem setup 

Our MCAE is built on the sparse autoencoder (SAE). A 
basic autoencoder is a fully connected neural network 
with one hidden layer and can be decomposed into 
two parts: an encoding and a decoding process. Assume 
an input dataset with n instances X = {xi}™ =1 where 
Xi G M m and m is the dimension of each instance. 
Encoding typically transforms input data to hidden layer 
representation using an affine mapping squashed by a 
sigmoid function: 

he{x i )=f(W e X i + b e ) (1) 

where /(•) is a sigmoid function and 0 e = {W e ,b e }, 
W e G M /cxm ,6 e G R k is a set of unknown parameters 
in encoding with k nodes in hidden layer. 



Figure 3. (a) Illustration of the proposed MCAE model 
in a stacked autoencoder structure, where black edge 
between two layers are linked to and shared by two tasks, 
red and blue links are separately connected to left and 
right task respectively, (b) A zoom in structure of MCAE. 

While in decoding, with parameters Od = {Wd,bd}, 
Wd G R rnxk ,bd G M m , autoencoder attempts to recon¬ 
struct the input data at the output layer by imposing 
another affine mapping followed by nonlinearity to hid¬ 
den representation h e (xi ): 

h d {x i ) = f{W d h e (x i ) + b d ) (2) 

In above equation hd(xi) is viewed as a reconstruction 
of input x^ Normally, we impose hd{xi) ~ xi. Here Xi 
play a role of reconstruction target in this expression and 
we use notation (i :Xi, t:Xi) to denote the configuration 
of input data short for i and reconstruction target short for 
f in an autoencoder. X s and X r indicate synthetic and 
real data respectively. By minimizing the reconstruction 
errors of all data instances, we have following objective 
function: 

1 n 

J(6 e ,0 d ) = -y2(h d (x i )-x i ) 2 + \W (3) 

i=i 

where W = ($2 W e 2 + S W|)/2 is a weight decay term 
added to improve generalization of the autoencoder and 
A leverages the importance of this term. 

To avoid learning identity mapping in autoencoder, 
a regularization term 0 = J2i=i £log|- + (1 - S)logjf£- 
that penalizes over-activation of the nodes in the hidden 
layer is added. S is a sparsity parameter and is set by 
users and h~\ Yh =i h e (xi). 

1 U 

J(0 e , Od) = ~ - x if + A W + pQ (4) 

i= 1 

p controls sparsity of representation in hidden layer. 

Note that directly applying sparse autoencoder to 
our problem does not work well. For example, we can 
train an autoencoder purely by placing synthetic data 
in input layer and real data in output layer denoted as 
(i:X s , i:X r ) which however can not bridge the synthetic 
gap in our problem. Such way of reconstruction is only 
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to complement the missing information in synthetic data 
from real data but not vice versaE] 

A better representation should be reconstructed by 
using the information from both real and synthetic data 
simultaneously Specifically, we aim at two tasks: one is 
(i:X s , t:X r ) L which reconstructs synthetic data towards 
real data, and the other one is (i:X r , i:X r ) R which uses 
identical real data for input and reconstruction target , 
where (•) L and (-) R indicate the left and right channel 
of MCAE. 


3.2 MCAE model 

We propose a multichannel autoencoder that uses a bal¬ 
ance regularization to leverage the learning between two 
tasks, i.e. (i:X s , i\X r ) L and (i:X r , t:X r ) R . The structure 
of this new autoencoder is shown in Fig. [3] In this new 
structure, tasks of two channels will share the same 
parameters 0 e in encoding process which will enforce au¬ 
toencoder to reconstruct common structure in both tasks. 
However, in decoding process, we divide autoencoder 
to two separate channels that two tasks will have their 
own parameters 0% and 0 R . Dividing autoencoder to two 
channels at decoding layer enable a more flexible control 
between the two tasks. Thus autoencoder better leverage 
the common knowledge from the two tasks. 

With two channels in the MCAE, we target to mini¬ 
mize the reconstruction error of two tasks together while 
taking into account the balance between two channels. 
The new objective function of the MCAE is given in the 
following: 

£ = j L {e e ,o L d ) + j R {e e ,e R ) + ^ (5) 

where 

*= l -{J L {0 e ,0 L d )-J R {0 e ,0K)) 2 (6) 

is a regularization added to balance the learning rate 
between two channels. This regularization will have two 
effects on the MCAE. First, ^ accelerates the speed 
of optimizing Eq. [9j since minimizing requires both 
J L {0 e ,0 R ) and J R (0 e ,0 R ) are small which in turn cause 
E decreases faster. Second, T penalizes a situation more 
when difference of learning error between two channels 
are large, so as to avoid imbalanced learning between 
two channels. 

The minimization of Eq. [9] is achieved by back prop¬ 
agation and stochastic gradient descent using Quasi- 
Newton method. Since the regularization term is added 
to leverage the balance of different tasks, we have to 
compute the gradient of parameters 0 e and 0%,6 R in 
MCAE. Please refer to the supplementary material for 
the detailed computation of gradients. 

1. Please refer to supplemenatry maerial for the validation 


3.3 The advantages of MCAE over the alternative 
Configurations 

Our MCAE enforces autoencoder to learn useful class 
patterns from the two tasks simultaneously. Thus it helps 
with capturing a common structure of synthetic and real 
images. Another alternative way is to concatenate the 
input and target of the two tasks (i:X s X r , i:X r X r ) for 
autoencoder. We annotate the usage of this autoencoder 
as Concatenate-Input Autoencoder (CIAE), since this 
autoencoder learns concatenated tasks at the same time. 
Such configurations however may result in an unbal¬ 
anced optimization for these two tasks: the optimization 
process of one task will take over the process of the 
other one. It results in a biased reconstructed hidden 
layer of the autoencoder and thus a limited classification 
performance. Our experiments also validate this point in 
Sec. |5] 

4 Generating Synthetic Data 

It is an important and yet less studied topic of how 
to generate synthetic data. This section discusses the 
methodology of generating synthetic data used in our 
experiments. Such synthetic data have some similarities 
and differences with the augmented data used in deep 
learning e.g. f2H . Both of synthetic data and augmented 
data aim at improving the generalisation capacity of 
classifiers. Nevertheless, the methodology of generating 
synthetic data brings more deformed patterns than the 
simply label-preserving transformations used in data 
augmentation. 

Synthetic data are created to highlight the potential 
useful pattern existed in real images. We have two stages 
of generating synthetic data. In the first stage, for each 
real data used to train MCAE, a synthetic version that 
best matching appearance of the real data is generated; 
thus pairs of corresponding real and synthetic data can 
be used to train the MCAE. In the second stage, more 
synthetic data could be derived using synthetic data 
generated in the first stage by both interpolation and 
extrapolation. To distinguish the set of synthetic data 
used in these two stages, we use abbreviation Syn I and 
Syn II to represent them respectively. 

In the proposed approach, the synthetic data are rep¬ 
resented as a parametric model of a set of control points 
and edges associated to these points in the images. 
From the control points, the synthetic images could be 
generated to simulate the real images in terms of having 
the same structure or a similar appearance. Initially, the 
control points are selected from a synthetic prototype 
that generalize all images in the same class. Then the 
locations of the control points are iteratively optimized 
until convergence in order to minimize the distance be¬ 
tween synthetic images generated by control points and 
the real image. We annotate the control points and edges 
associated to them as s = {P,E}, where P = {pi}f =1 
is the set of the control points, and E = '{(pi,Pj)}, 1 < 
i. j < n is the set of edges connecting control points. 
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A generalized algorithm of getting the best matching 
synthetic image is provided in Algorithm [l] 

Algorithm 1 Get Matching Synthetic Image. 

Input: 

• A real image U. 

• A set of control points S = {P, E} with all control points 
Pi G P set to their initial positions. 

• A prototype image V generated using the initial S. 

1: while S is not converged do 

2: S = OptimizeControlPoints(f/, V, S). 

3: Generate V using S. 

4: end while 

5: Generate synthetic image I using S. 

6: return I. 


The synthetic prototype could be manually design or 
learning from given data in our work given different 
tasks. We will show how these two methods produce 
synthetic data in following two sections respectively. 

4.1 Explicitly Design of the Synthetic Prototype 

The generation of the synthetic prototype and control 
points in this scenario is inspired by the approach 
proposed by Zhang et al. (49|. In their work, given 
enough pre-knowledge about the 3D objects, a synthetic 
prototype of 3D objects is explicitly designed and built. 
By adjusting the control points of the prototype, var¬ 
ious kinds of 3D objects are generated. In this work, 
essentially, our data is very similar to theirs in a sense 
that roof images share a lot of characteristics such as 
ridge lines, valley lines and intersections between these 
lines, which make it possible to manually design the 
synthetic prototypes that characterize these patterns. 
Based on this observation, a synthetic roof prototype 
could be generated by setting the control points at the 
intersections of the ridge or valley lines and drawing 
segments connecting these control points. ^ 

In this scenario, the OptimizeControlPoints(U, V, S) 
function of Alg.Jl] turns out to be a process that searches 
for optimal control point locations which results in a 
synthetic image minimizing the discrepancy between the 
real image and the synthetic image. A coordinate descent 
framework is employed to accelerate the search process. 
We summarize this method in Alg. [2j 

4.2 Learning Synthetic Prototype from Data 

In hand written digit dataset used in this work, we 
learn a synthetic prototype from given data. A digit 
prototype is generated for all images with the same digit. 
Congealing algorithm proposed in f27l is employed in 
this step to produce the synthetic prototypes for digits. 
In congealing, the project transformations are applied to 
images to minimize a joint entropy. Thus the prototype 

2. In our experiments, classification of the roof images is essentially 
similar to that of HU. Our approach recognizes the style of the roofs 
based on edges extracted from the roof images. For more visualisation 
results, please refer to our supplementary material. 


Algorithm 2 OptimizeControlPoints(U, V, S) Case 1 

Input: 

• A real image U. 

• A prototype of the synthetic image S = {P, E}. 

• A synthetic image V generated using S. 

1: for pi G P, 1 < i < n do 

2: Update S by moving pi by one unit. 

3: Generate V using S. 

4: if S does not reduce Dist(U, V) then 

5: Cancel the last move of pi. 

6: Generate V using S. 

7: end if 

8 : end for 
9: return S. 


is considered to be an average image of all images after 
congealing. 

Then control points are evenly sampled from the 
boundary detected from the prototype image. The con¬ 
trol points needs to be mapped to each digit image in 
order to generate a synthetic image. To find this mapping 
we implement an approach that migrates the control 
points from the prototype images to destination image. 

This point migration algorithm is based on a series 
of intermediate images generated in between synthetic 
prototype and destination image. To generate the in¬ 
termediate images, we binarize all the images and the 
distance transformed images 0 of the synthetic pro¬ 
totype and the real image are generated. Given the 
number of steps, an intermediate image then is gen¬ 
erated as a binarized image of linear interpolation be¬ 
tween two distance transformed images. In each step, 
the control points are snapped to the closest bound¬ 
ary pixels of the intermediate image. The algorithm of 
OptimizeControlPoints(U, V, S) in this situation is given 
in Algorithm |3j we fix the number of steps to 10 in this 
algorithm. 


i 


Figure 4. Illustrations of the migration of control points 
and intermediate synthetic images generated using con¬ 
trol points in each step. The distance transform images of 
the synthetic prototype and real images are shown as the 
left most and right most images respectively. 

To generate the Synll dataset, we either interpolate or 
extrapolate between control points of randomly choose 
two synthetic images from Synl dataset. The weights 
used in interpolation and extrapolation is uniformly 
drawn from 0 to 1. 

5 Experiments and Results 

We validate the proposed MCAE dataset on several 
applications in this section. This section is organised as 
follows. First, in Sec 5.1, we introduce a new benchmark 
dataset - Satellite Roof Classification (SRC) dataset to 
vision community. This dataset is of high quality satellite 


□ □ □ 

















Algorithm 3 OptimizeControlPoints(U, V, S) Case 2 


Input: 

• A real image U. 

• A prototype of the synthetic image S = {P, E}. 

• A synthetic image V. 

steps = 10. 

Compute distance transform image of U, V as U', V'. 
for i — 1 to steps do 

'=(1 -Ttk-sW + J^V. 

i=Binarize(/). 

Update S by snapping to the closest boundary pixel on 


I. 

7: end for 

8: Set the status of S to be converged. 

9: return S. 


roof class labels for the satellite images. We also briefly 
summarizes the handwritten digits dataset used in our 
paper. We explain the experimental settings in Sec. 5.2 
and discuss the experimental results in Sec. 5.3. 

5.1 Experiment Datasets 

5.1.1 Satellite Roof Classification (SRC) Dataset 

One particular interesting problem of learning classifiers 
from synthetic data is to analyze satellite images of the 
Earth. Such problems generally need very high quality 
(expert-level) labeled data. However, there is no pre¬ 
vious dataset for such research purposes. To facilitate 
the study, a new benchmark Satellite Roof Classification 
(SRC) Dataset is created and used in our experiments. 
Given a satellite image, we employ a method described 
in Il48l to crop roof images by registering artificial 
building footprints with the satellite image. Later, all 
roof images are aligned using their footprint principal 
directions using a method proposed in [50) and then are 
scaled to images with resolution of 128 x 256. Two experts 
are invited to contribute the labels of 6 different roof 
styles: flat, gable, gambrel, halfhip, hip and pyramid. 
Example instances of SRC dataset are shown in Fig [5j 

This dataset is of great challenges for the task of visual 
analysis. First, qualities of the some satellite images are 
degraded because of significant image blurring occurred 
when capturing the satellite images. Second, roofs in 
these images are covered by various kinds of equipments 
such as air conditioners chimneys and water tanks, and 
most of roofs in our dataset are partially occluded by 
shadows cast by trees and some other stuffs. Such cov¬ 
ering and shadows are great obstacles to robust visual 
analysis algorithms. Furthermore, the class instances of 
SRC dataset are naturally extreme imbalance, since some 
particular types of roofs (such as gambrel and pyramid) 
are far less than the other types in the real world. Such 
unbalanced distributions of data are compared in in 
Table [l] 

Classification of the roof styles in the experiments 
are based on recognizing edges detected from the roof 
images. We employed the adaptive Otsu edge detection 
method [29] to extract edges from the roof images. We 


Styles 

Training # 

Testing # 

Total # 

Flat 

1232 

1748 

3080 

Gable 

1111 

1665 

2776 

Gambrel 

156 

232 

388 

Halfhip 

268 

400 

668 

Hip 

960 

1440 

2400 

Pyramid 

133 

199 

332 


Table 1 

The distribution of the roof styles used in the 
experiments. 


create synthetic prototype to characterize primary ridge 
lines or valley lines in a certain type of roof style. 
Examples of the synthetic prototypes are shown in Fig. 
[6] Real roof edge images and matching Syn I images are 
shown in Fig. [T| To create Syn II images of this dataset, 
2000 synthetic images are produced by interpolation and 
extrapolation between images in Syn I for each roof style. 



Figure 6. The illustration of synthetic roof prototype we 
used in the process of finding matching synthetic data for 
each real data. There are two types of control points: red 
ones and blue ones. Blue control points are constraint 
to move along the boundary only. The area of point’s 
movements masked using light blue. 


5. 1.2 Handwritten Digits Dataset 

We also validate our framework on handwritten digits 
dataset from UCI machine learning repository [l] which 
totally has 5620 instances. The handwritten digits from 0 
to 9 in this dataset are collected from 43 people: 30 con¬ 
tributed to the training set and the other 13 to the test set. 
In the experiments, the Syn I data are generated using 
Algorithm [3] The Syn II data of this dataset is generated 
using interpolation and extrapolation as described in Sec 
4. 

5.2 Experimental Settings 

We fix the configuration of MCAE as (i:X s , i\X r ) L (left 
channel) and (i:X r , t:X r ) R (right channel). Specifically, 
the left channel is the reconstruction process from syn¬ 
thetic data to real data, while the right channel works 
in the same way as a standard SAE. Our experimental 
results will show that the representations learned in such 
way greatly benefit the performance of classifiers we 
compared. 

In the experiment^] two different classifiers of uti¬ 
lizing learned representations from MCAE (from Sec. 
3) are compared. In the first scenario, MCAE encodes 
input data to a representation (feature) in the hidden 

3. All codes (including our MCAE and creating synthetic data) will 
be released once accepted. 
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Figure 5. The illustration of the SRC dataset. 


layer and a SVM using RBF kernel is employed in this 
case to show the performance of the classification. In 
the second scenario, MCAE takes the input images and 
produces the reconstructed images at the output layer. 
Features, in this case, are images, therefore can be fed to 
Convolutional Neural Network (CNN) for classification. 
In our experiments we build a LeNet-5 [24} which is 
originally created for digit recognition. We show that 
using the same number of input data, the performance 
of the CNN prefers to the data produced by the MCAE. 

We summarize all evaluations and comparisons using 
F-l score, which is defined as: 

_ Precision • Recall 

rp _ o _ /y\ 

Precision + Recall 


5.3 Evaluatior@ 

MCAE is better than CIAE and SAE. To better 
evaluate the performance of the proposed MCAE, we 
compare MCAE with Concatenate-Input Autoencoder 
(CIAE) (281 and Sparse Autoencoder (SAE) (43). In these 
experiments, we evaluate the performance on two clas¬ 
sifiers: a CNN using reconstructed images and SVM 
using encoded hidden layer representation. We present 
the results of these comparisons in Table [2] and Table [3] 
for SRC and handwritten digit datasets respectively. It 
could be observed from these two tables that although 
the performance of the CIAE is close to MCAE, the 
proposed MCAE gets a better performance almost in all 
the comparisons. 

Synthetic data help learning a better classifier. We 

designed another group of experiments. In these exper¬ 
iments three different configurations of data are either 
reconstructed and encoded using the proposed MCAE, 
then used to train a CNN or a SVM in the experiments. 
All results from these experiments are compared in Table 
[4] and Table [5] respectively. An interesting thing to notice 
is that in experiments, using synthetic data can only 
achieve the same result as using a combination of real 

4. Due to the page limit, please refer to our supplementary material 
for the additional experimental results. 



Data to train 

CNN 

SVM 


autoencoder 

Reconstructed 

Encoded 

MCAE 

(i :Syn I, t:Real) L 
(i:Real, t:Real) H 

0.68 

0.80 

CIAE 

(i :Syn I + Real, 
t: Syn I + Real) 

0.68 

0.78 

SAE 

(i:Syn I, t:Syn I) 

0.63 

0.59 

SAE 

(i:Real, t:Real) 

0.62 

0.62 


Table 2 

FI-score of roof style classification using reconstructed 
images (in CNN) and encoded image features (in SVM). 
Second column shows the data used to train the 
autoencoder in the first column. In classification, 
Real+Syn II are used in the training of CNN and SVM. ; 
Syn /+ Real means that we use concatenation of the 
Syn I and real images as the input for the corresponding 
autoencoders. 



Data to train 

CNN 

SVM 


autoencoder 

Reconstructed 

Encoded 

MCAE 

(i :Syn I, t:Real) L 
(i:Real, t:Real) H 

0.98 

0.96 

CIAE 

(i :Syn I + Real, 
t :Syn I + Real) 

0.97 

0.96 

SAE 

(i:Syn I, t:Syn I) 

0.94 

0.91 

SAE 

(i:Real, t:Real) 

0.95 

0.65 


Table 3 

FI-score of handwritten digit recognition. 


and synthetic data. This result proves that the distribu¬ 
tion of the real data in this case is almost overlapping 
with the distribution of the synthetic data. 

MCAE bridges the synthetic gap. We compare the 
correlation defined as: 


Cov(X, Y ) 
Var(X)Var(V) 


( 8 ) 


between real and Syn I data before and after being 
reconstructed by the MCAE. The intention of these com¬ 
parisons is to show that real synthetic images become 
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Figure 7. t-SNE [^visualization of synthetic gap bridged by MCAE. (a) Data distributions of each class of SRC 
dataset. For many data instances, the (circle) real and (dot points) synthetic data are not overlapping. This is synthetic 
gap. (b) Data distributions of the reconstructed images by MCAE for each class of SRC dataset. The reconstructed 
images of all the real (circle) and synthetic (dot points) are almost overlapped. It means that our MCAE can bridge the 
synthetic gap. 


Feature type Real Syn II Real+Syn II 

CNN Reconstructed 0.65 0.68 0.68 

SVM Encoded 0.77 0.78 0.80 

Table 4 

FI-score of roof style classification by classifier (CNN 
and SVM) using different set of data reconstructed of 
encoded using the proposed MCAE. 



Feature type 

Real 

Syn II 

Real+Syn II 

CNN 

Reconstructed 

0.94 

0.96 

0.96 

SVM 

Encoded 

0.96 

0.96 

0.98 


Table 5 

FI-score of handwritten digit recognition. 


much more alike each other in terms of the appearance 
after being reconstructed by the MCAE. The results are 
shown in Fig. [8] It is shown that our method almost 
achieves 100% correlation between real and Syn I when 
both data are reconstructed by the proposed MCAE. 
That means the proposed MCAE bridges the synthetic 
gap between the real data and the synthetic data. The 
results are shown in Fig. [7] It intuitively shows that our 
MCAE can help bridge the synthetic gap between real 
and synthetic data. 

6 Conclusion 

In this paper we identify the problem of synthetic 
gap. By solving this problem, in our experiments, we 
demonstrate that the synthetic data could be used to 
improve the performance of classifiers. To better learn 
classifiers from synthetic data, we have proposed a novel 
Multichannel autoencoder (MCAE) model. MCAE has 
multiple channels in its structure and is an extension 
from standard autoencoder. We show that MCAE not 




Figure 8. Correlation between real and corresponding 
best matching Syn I data. 

only bridges the synthetic gap between real data and 
synthetic data, it also jointly learns from both real and 
synthetic data, thus can provide more robust represen¬ 
tation for both data. To facilitate the study on satel¬ 
lite image analysis, we introduce a novel benchmark 
dataset - SRC as one dataset used in our experiments. 
The proposed method has been validated on SRC and 
handwritten digits datasets. 
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Supplementary material 

6.1 Optimization of MCAE 

With two branches in the MCAE, we target to minimize 
the reconstruction error of two tasks together while 
taking into account the balance between two branches. 



The new objective function of the YMAE is given in the 
following: 
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E = J L (e e ,0 R ) + J R (0 e ,8 R ) + 7 * (9) 

where 

y= l -{j L {e e ,e L d )-j R (o e ,e R )) 2 (io) 

is a regularization added to balance the learning rate 
between two branches. This regularization will have 
two effects on the YMAE. First, T accelerates the speed 
of optimizing Eq. [9j since minimizing \I> requires both 
J L (Q e ,0%) and J R {Q e ,0 R ) are small which in turn cause 
E decrease faster. Second, T penalize a situation more 
when difference of learning error between two branches 
are large, so as to avoid imbalanced learning between 
two branches. 

The minimization of Eq. [9] is achieved by back prop¬ 
agation and stochastic gradient descent using Quasi- 
Newton method. In the MCAE, with balance regular¬ 
ization added to the objective, the only difference as op¬ 
posed to sparse autoencoder is the gradient computation 
of unknown parameters 0 e and 0%,6 R . We clarify these 
differences in the following equations: 


^ ^ dJ L dJ R , t TB ^dJ L dJ R N 

Vw e E=- + — + 1 (J L -J R ){-, 


V be E = 


dW e dW ( 
dJ L dJ R 


and 


db P 




W: 


,E = 


db u 

dJ L 

dW^ 


+ 7( J L - J R ) 


0W e dW e 
d.J L _ 
db e db e 

dJ L 


dW d L 


V hL E=E^+ 7 (J L - J r ) 9JL 


db L d 


db L d 


„ „ dJ R /T , TffN/ dJ R . 

W w rE — R +7 (J —J )( — an , b ) 


d\V R 


dW R ' 




db R 


db R 


( 11 ) 


( 12 ) 


The exact form of gradients of 0 e and 0%,0 R varies 
according to different sparsity regularization 0 used in 
the framework. 


7 Generating synthetic data 

An example in Fig. [9] shows how control points are 
moved from source image (the one with blue boundary) 
to destination image (the one with red boundary). It 
could be observed that most of the control points are 
moved from the source image to corresponding locations 
on destination image. In this step, it is not necessary for 
all control points accurately move to exact corresponding 
location on the destination image. Our goal is just to use 
these migrated control points to generate synthetic data 
which will roughly mimic the real data. Our MCAE later 
will rectify the difference between synthetic data and real 
data. 



Figure 9. An example of migration of the control points 
from source image (blue) to destination image (red). 


7.1 Further Validation for MCAE 

Note that directly applying sparse autoencoder to our 
problem does not work well. For example, we can train 
an autoencoder purely by placing synthetic data in input 
layer and real data in output layer which however can 
not bridge the synthetic gap in our problem. Such way 
of reconstruction is only to complement the missing 
information in synthetic data from real data. On the 
contrary, reconstructed real data using such SAE will 
add unnecessary information and noisy patterns to re¬ 
constructed data. 

To validate this point, we extend the experiments of 
two datasets and show that SAE can not bridge the gap 
of synthetic gap. The results are shown in Fig. [10] The re¬ 
constructed data of SAE have lower average divergence 
than the other methods. That means, SAE performances 
worse than MCAE in bridging the synthetic gap. 
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Figure 10. MCAE almost perfectly bridge the synthetic 
gap and is much better than SAE on this job. 
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